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Abstract 


This  final  report  documents  the  results  of  a  five-year  investigation  of  methods  for  achieving 
higher  performance  for  knowledge-based  systems  through  the  design  of  innovative 
software  and  hardware  systems  architectures.  Volume  l  summarizes  the  work  performed 
and  lessons  learned,  and  serves  as  an  annotated  index  to  the  set  of  over  50  project  technical 
reports.  Volumes  2  through  4  contain  the  project  technical  reports. 

1 .  Introduction 

The  Expert  Systems  on  Multiprocessor  Architectures  (ESMA)  project  was  initiated  in 
March  1985,  and  technical  work  was  completed  in  1990.  The  research  was  conducted  at 
Stanford  University's  Knowledge  Systems  Laboratory.  The  results  and  findings  of  the 
project  were  published  in  a  series  of  technical  reports,  which  comprise  Volume  2  of  this 
Final  Report,  Volume  1  sets  forth  the  basic  concepts  that  underlie  the  research,  and 
provides  a  road  map  to  guide  the  reader  through  that  technical  literature.  Volume  1  ends 
with  a  project  bibliography,  which  serves  as  a  table  of  contents  for  Volumes  2  through  4. 
ESMA  builds  upon  and  straddles  a  number  of  areas  of  research  in  computer  science, 
including  artificial  intelligence,  programming  languages,  operating  systems, 
communication  protocols  and  hardware.  Prior  to  this  project,  some  work  was  done  on 
analyzing  the  performance  of  rule-based  systems  on  parallel  architectures,  most  notably  by 
Gupta  [Gupta  86].  On  the  hardware  side,  there  are  commercially  available  machines  that 
are  similar  in  some  respects  to  the  architectures  considered  here,  notably  the  Ametek 
machine.  The  relationship  of  ESMA  to  work  in  other  fields  is  documented  copiously  in  the 
papers  cited  below. 

On  the  other  hand,  th>s  investigation  is  unique:  the  project  focussed  on  applications 
characterized  by  symbolic  (largely  non-numerical)  computation;  took  an  end-to-end  multi¬ 
level  approach  toward  identifying  and  exploiting  concurrency;  and  used  highly 
instrumented  simulation  to  permit  careful  analysis  of  experimental  results. 

In  the  remainder  of  this  chapter  we  set  forth  the  goals  of  the  project  and  list  the  personnel 
who  contributed  toward  achieving  those  goals.  Chapters  2  through  5  describe  and 
summarize  each  of  the  four  levels  of  analysis  in  our  multi-level,  vertical-slice  strategy. 
Chapter  6  draws  together  the  princ'oal  conclusions  and  lessons  that  were  learned  from  this 
research.  Chapter  7  is  a  full  bibliography  of  the  technical  reports  that  were  produced  by 
project  staff.  Chapter  8  lists  other  referenced  works. 

1.1.  Project  Goals 

The  project's  primary  goal  is  to  find  ways  to  increase  the  performance  of  expert  systems 
through  the  use  of  the  new,  emergent,  parallel  hardware  designs. 

The  number  of  possible  implementation  strategies  for  such  a  project  is  huge.  One  has  only 
to  look  at  the  large  number  of  different  hardware  designs  that  are  emerging  and  at  the  num¬ 
ber  of  different  problem-solving  methods  to  see  how  combinatorial  the  problem  would  be  if 
we  endeavored  to  investigate  all  of  the  reasonable  and  plausible  combinations  of  architec¬ 
tures.  It  was  decided,  therefore,  that  we  could  learn  a  great  deal  simply  from  making  a 
commitment  to  one,  or  at  least  a  small  number  of  different  options  at  each  point  in  the  sys¬ 
tem's  make-up.  We  thus  decided  to  take  a  "vertical  slice"  through  the  space  of  possible 
solutions.  Clearly  we  did  not  intend  to  investigate  any  options  that  seemed  non-useful,  so 
we  knew  from  the  outset  that,  although  we  could  not  prove  that  we  had  the  best  design  to 
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meet  our  goals,  our  design  would  nevertheless  be  at  least  a  plausible  architecture  for  a  fu¬ 
ture  computational  environment 

We  viewed  the  task  of  implementing  concurrent  expert  systems  as  being  one  which  was 
split  into  a  number  of  implementation  layers.  If  we  could  achieve  speed-up  at  each  one  ot 
ihese  layers,  then  we  could  hope  for  a  substantial  overall  performance  improvement  com¬ 
pared  to  existing  AI  systems.  Our  model  of  the  layers  into  which  the  project  could  be  split 
is  shown  in  Figure  * . 


Applications 


Problem-Solving  Frameworks 


Knowledge  Retrieval 


Resource  Management 


Programming  Languages 


Operating  Systems 


Systems  Architectures 


Figure  1.  The  layers  of  system  implementation  through  which  we  hoped  to 

achieve  computational  speed-up  in  the  project. 

It  was  originally  anticipated  that  the  needs  of  the  applications  would  drive  the  development 
of  the  problem-solving  frameworks  and  so  on  down  through  the  implementation  hierarchy 
shown  in  Figure  1  until  eventually  the  hardware  would  be  designed  under  the  constraints 
passed  down  from  above.  In  practice,  however,  this  did  not  happen.  Because  of  the  diffi¬ 
culty  of  finding  and  mounting  an  application  suitable  to  our  needs  and  the  early  availability 
of  personnel  interested  in  the  hardware  design  aspect,  the  hardware  design  went  ahead 
more  rapidly  than  the  other  layers.  This  resulted  in  our  designs  being  more  hardware 
driven  than  application  driven.  This  approach  has  its  advantages,  for  example,  an  entirely 
top-down  design  process  could  easily  have  resulted  in  low-level  system  requirements 
which  were  not  implementable. 

As  well  as  the  thrust  of  the  project  coming  from  the  bottom  rather  than  the  top,  the  levels  of 
abstraction  actually  implemented  differed  significantly  from  those  shown  ;n  Figure  1. 
Figure  2  gives  a  more  realistic  representation  of  the  layers  that  were  actually  investigated, 
as  opposed  to  what  we  intended  to  do. 
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Applications 


Problem-Solving  Frameworks 


Resource  Management 


Programming  Languages 


Hardware  Systems  Architectures, 

Topologies  and  Protocols 

Figure  2.  The  layers  that  were  actually  implemented  in  the  project.  Resource 

Management  is  shown  in  small  type  because  it  was  a  recent  addition  and  most  of  our 
work  was  done  without  the  help  of  this  layer. 

The  Knowledge  Systems  Laboratoiy  has  considerably  more  expertise  in  software  than  in 
hardware.  We  thus  decided  early  on  not  to  build  any  hardware  -  there  are  many  other 
research  groups  that  could  do  this  better  than  we.  We  decided,  therefore,  to  simulate  our 
hardware.  This  would  allow  us  to  modify  our  software  and  hardware  designs  easily  and 
allow  us  to  extract  the  maximum  insight  with  the  minimum  effort 

The  rest  of  this  paper  is  split  into  sections  which  reflect  the  major  layers  shown  in  Figure  2. 
In  each  of  these  sections  the  work  of  the  relevant  sub-projects  will  be  discussed.  Because 
of  the  bottom-up  thrust  of  the  project  the  project's  components  will  be  discussed  in  a  bot¬ 
tom-up  order.  This  will  also  reduce  the  number  of  forward  references  made,  since  discus¬ 
sion  of  the  higher  layers  will  inevitably  have  to  refer  to  the  substrates  on  which  they  are 
implemented. 

1.2.  Personnel 

This  project  has  employe  d  a  large  number  of  people  over  the  years  and  it  seems  appropriate 
to  name  them  all  here: 

Ed  Feigenbaum,  Bob  Engelmore,  Penny  Nii,  Bruce  Delagi,  Harold  Brown,  Hiroshi 
Okuno,  John  Delaney,  Byron  Davies,  Hirotoshi  Maegawa,  Nelleke  Aiello,  James  Rice, 
Nakul  Saraiya,  Sayuri  Nishimura,  Eric  Schoen,  Greg  Byrd,  Max  Hailperin,  Russell 
Nakano,  Masafumi  Minami,  Chris  Rogers,  Alan  Noble,  Jean-Christophe  Bandini,  Manu 
Thapar,  Djuki  Muliawan,  Pandu  Nayak,  Jerry  Yan  and  Sam  Hahn. 

2.  Hardware-Level  Systems  Studies 

As  was  mentioned  above,  hardware  system  design  led  the  way  in  the  project.  In  this 
section  we  discuss  a  little  bit  of  the  motivation  for  the  hardware  designs  and  briefly  de¬ 
scribe  both  the  current  generation  of  hardware  designs  on  which  we  are  working  and  the 
simulator  we  are  using. 
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Figure  3.  The  Simple  system  provides  a  toolkit  from  which  to  build  circuits  to 

be  simulated,  a  collection  of probes  to  connect  to  the  circuit  and  a  set  of  instruments  to 

conned  to  the  probes. 

The  hub  of  all  of  the  work  done  on  the  project  has  been  the  digital  circuit  simulator,1  upon 
which  everything  else  is  built.  This  simulator  is  called  Simple.  It  is  an  event-driven 
simulator,  designed  to  allow  the  user  to  design  and  specialize  digital  circuits  in  a  simple  and 
modular  way,  using  a  circuit  design  tool  called  Helios.  A  sophisticated  set  of  instrument 
tools  allow  the  user  to  design  and  specialize  simulated  probes  which  can  be  connected  to 
the  circuit  while  it  is  running.  This  allows  the  connection  of  a  number  of  instruments  to  the 
probes  that  permit  the  user  to  see  the  behavior  of  the  circuit  as  it  operates  without  interfer¬ 
ing  with  the  behavior  of  the  system.  We  like  to  view  this  model  as  one  of  a  laboratory 
workbench  equipped  with  collections  of  instruments,  probes  and  circuit  building  compo- 


Notc:  This  simulator  could  be  used  to  simulate  cv';,..s  down  to  the  gale  level,  but  one  of  its  powerful 
attributes  is  its  ability  to  allow  the  programmer  to  define  the  behavior  of  composite  objects  in  terms  of 
methods  that  make  these  devices  appear  to  be  atomic  black  boxes.  This  ability  obviates  the  need  to  do  gate 
level  simulation  of  those  aspects  of  the  system  whose  behavior  is  well  understood.  This  has  enormous 
benefits  in  terms  of  simulation  time. 


nents  from  which  the  user  can  build  systems  and  on  which  the  user  can  perform  quantita¬ 
tive  experiments  (see  Figure  3). 

The  key  factors  that  make  the  Simple  simulator  so  powerful  are  detailed  in  [Delagi  86b,  2- 
272] [Delagi  87, 2-294]  and  [Saraiya  90a,  4-360].  In  effect,  the  simulator  focuses  on  the 
critical  design  aspects  of  multiprocessor  design,  namely  interprocessor  communication  and 
topology.  The  simulation  is  less  detailed  in  other  areas.  This  allows  the  user  to  simulate 
the  execution  of  sophisticated  problems,  rather  than  the  toy  problems  or  small  code 
fragments  possible  with  other  simulators.  The  instrumentation  in  the  simulator  is  powerful 
and  flexible,  not  only  allowing  the  user  to  observe  events  in  the  simulated  system  at 
multiple  levels  of  abstraction,  but  also  readily  allowing  the  user  to  modify  and  specialize 
instrumentation  so  as  to  focus  the  simulator  more  sharply  on  interesting  application-specific 
behavior.  This  allows  the  user  to  gain  substantial  insight  from  simulator  runs,  while  still 
allowing  the  user  to  reconfigure  the  system  easily  and  quickly  in  the  event  of  an  unexpected 
result  prompting  unplanned  experiments. 

It  was  found  early  on  that  simulations  of  the  sort  we  wanted  to  do  would  be  computation¬ 
ally  very  expensive.  An  experiment  was  performed,  therefore,  to  parallelize  the  simulator 
itself  in  an  attempt  to  bring  down  the  times  taken  for  the  simulations,  which  often  exceeded 
one  day  in  duration.  This  resulted  in  AIDE,  a  distributed  version  of  Simple  [Saraiya  86, 4- 
297].  Unfortunately,  we  were  unable  to  achieve  any  speed-up  at  all  for  our  simulations, 
largely  because  of  the  communication  bandwidth  and  latency  associated  with 
communicating  between  the  multiple  Symbolics  machines  we  were  using  via  an  Ethernet 
and  because  the  simulator,  being  event-driven,  required  frequent  synchronization  on  the 
event  queue,  which  serialized  the  processing.  Although  this  experiment  yielded  a  negative 
result,  it  was  valuable  in  demonstrating  the  importance  of  process  grain  size  and 
synchronization  effects. 

2.2.  CARE 

The  Simple  simulator  mentioned  above  was  used  to  design  and  build  what  we  refer  to  as 
the  CARE2  machine  and  simulation  system  [Delagi  88a,  2-301]  (see  Figure  4).  The  CARE 
machine  is  that  simulated  machine  on  which  all  of  the  experiments  mentioned  below  have 
been  performed.  The  machine's  design  has  a  few  key  features  which  are  worthy  of  note: 

♦  Dynamic  cut-through  routing  with  local  flow  control,  in  order  to  optimize  network 
throughput  [Byrd  87c,  2-155].  This  protocol  uses  special  packet  terminators  and 
selective  buffering  to  avoid  deadlock  during  multicasts. 

*  Toroidal  topology.  Topology  can  be  motivated  by  high-level,  application  domain 
considerations,  but  it  is  also  motivated  by  such  low-level  concerns  as  packaging 
and  communication  protocols.  Cost  models  were  developed  to  characterize  several 
topologies  and  these  topologies  were  tested  under  simulation.  On  balance,  we 
believe  that  toroidally  connected  networks  have  the  best  overall  cost/benefit  tradeoff 
[Byrd  87b,  2-148]. 

•  Non-blocking  message  sending,  so  as  to  encourage  pipe-line  processing. 

•  Communications  network  with  alternative  paths  between  points,  so  as  to  reduce 
communications  problems  due  to  busy  communication  paths. 


1  Citations  fc.  project  reports  point  to  the  bibliography  at  the  end  of  this  volume  and  also  to  the  page 
number  where  die  report  can  be  found  in  volumes  2  through  4. 

2Thc  cxpans:  m  for  this  acronym  seems  to  have  been  lost  somewhere  in  the  wash.  We  think  that  it  has 
something  tc  jo  with  the  words  Concurrent  and  Array. 
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*•  ^separate  communications  controller,  in  order  to  support  operating  system  func¬ 
tions  and  to  implement  the  non-blocking  send  functionality  mentioned  above.  This 
communications  controller  is  referred  to  as  the  "Operator".  The  processor  in  each 
processing  element  that  executes  user  code  is  called  the  "Evaluator". 

AvSimplified  model  of  a  CARE  machine  processing  element  (site)  is  shown  in  Figure  4. 


Figure  4.  A  CARE  machine  processing  element  (site). 


These  processing  elements  can  be  connected  together  in  a  number  of  ways,  such  as  into 
grids  and  bus-based  networks  as  it  shown  in  Figure  5.  When  a  CARE  site  is  used  simply 
as  a  memory  controller  its  evaluator  processor  is  not  used.  Similarly,  when  a  site  is  used 
just  as  a  processor  in  a  bus-based  shared-memory  machine,  only  the  evaluator  is  used. 
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Figure  5.  CARE  sites  can  be  connected  together  into  a  variety  cf  distributed  and 

shared  memory  of  topologies.  In  this  example  we  show  a  six-way  connected  grid  and  a 

bus  based  machine. 

The  work  on  the  CARE  sub-project  has  focussed  mainly  on  the  design  of  inter-processor 
communication  networks,  as  is  appropriate.  This  has  meant  that  we  have  been  able  to  ig¬ 
nore  the  instruction  level  behavior  of  the  processors  themselves.  The  application  programs 
that  we  run  are  merely  timed  as  they  run  between  the  points  at  which  code  fragments  cause 
communication  between  processors.  Being  able  to  avoid  doing  register  level  simulation  of 
the  processors  themselves  has  allowed  us  to  execute  much  more  complex  and  realistic  pro¬ 
grams  on  our  simulated  machines.  We  have  therefore  traded  accuracy  in  our  processor 
simulation  -  assuming  that  the  processing  elements  will  behave  much  like  existing  Lisp 
Machine  processors  -  in  favor  of  greater  realism  in  terms  of  the  system’s  performance  un¬ 
der  the  load  of  real  programs. 

A  number  of  aspects  of  system  design  have  not  been  addressed  in  detail  and  the  simulations 
do  not  take  these  into  account.  Most  significant  among  these,  perhaps,  are  the  fact  that 
memory  usage,  code  distribution  and  garbage  collection  are  not  simulated,  i.e.  the  CARE 
machine  was  assumed  to  have  unbounded  local  memory  and  code  was  assumed  to  have 
been  distributed  uniformly  to  all  processors  at  load  time.  Thus,  although  the  CARE 
architecture  was  designed  with  garbage  collection  in  mind,  this  was  not  simulated  at  all.  In 
fact,  all  possible  extraneous  impediments  to  accurate  and  reproducible  run-time 
measurement  were  eliminated.  Such  simulation  machine  system  overheads  as  garbage 
collection,  paging,  I/O  and  page  creation  were  carefully  factored  out  of  the  timing.  This 
resulted  in  timings  that  were  not  "realistic"  in  the  sense  that  they  did  not  account  for  certain 
necessary  system  behavior,  but  these  timings  were  nevertheless  far  more  useful  in  general 
because  these  system  services  are  generally  non -deterministic  with  respect  to  the  simulation 
and  their  behavior  is  a  function  of  the  performance  of  the  simulation  machine,  not  the 
simulated  machine. 

The  CARE/Simple  simulator  system  is  perhaps  the  most  valuable  tangible  product  of  the 
project.  It  is  now  being  used  in  a  number  of  research  departments,  both  corporate  and 
academic,  outside  Stanford.  Like  all  project  software,  it  is  in  the  public  domain. 
CARE/Simple  will  soon  be  available  running  under  Common  Lisp,  CLUE  and  XI 1  on  a 
number  of  different  platforms. 
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3.  Operating  Systems  and  Languages 

A  considerable  amount  of  effort  has  been  spent  on  the  project  in  working  at  the  operating 
system  level  of  abstraction.  Because  our  experiments  dealt  with  a  single  task,  and  file 
system  issues  were  not  considered,  it  was  not  necessary  to  build  an  operating  system  per 
se.  The  CARE  machine  itself  features  a  dual  processor  for  each  processing  element.  This 
allows  much  of  the  work  of  an  operating  system,  particularly  inter-processor 
communication,  to  be  done  by  a  dedicated  processor  in  parallel  with  the  execution  of  user 
code.  The  behavior  of  this  communication  processor  is  coded  directly  into  the  simulated 
hardware. 

Amongst  the  work  that  has  been  done  in  this  area  has  been  work  on  concurrent  object-ori¬ 
ented  systems,  concurrent  Lisp  dialects,  programming  models  and  resource  allocation. 

3.1.  CAREL 

CAREL  [Davies  86,  2-226]  was  one  of  the  first  programs  written  to  run  on  the  CARE 
simulated  machine.  It  was  an  early  attempt  to  find  a  Lisp  language  interface  to  the 
distributed-memory  hardware  provided  by  CARE.  It  took  as  its  basis  Scheme  [Abelson 
83]  and  QLisp  [Gabriel  84]  and  included  primitives  to  allow  remote  function  calls  and 
remote  consing.  It  was  quickly  found  that,  because  of  the  cost  of  process  creation,  it  was 
desirable  to  make  the  best  use  of  any  processes  that  were  spawned.  This  efficiency  was 
accomplished  by  storing  application  dependent  data  in  non-ephemeral  spawned  processes. 
State  of  this  type  was  implemented  in  CAREL  as  writable  closure  variables.  These  process 
closures  could  be  used  as  elements  in  pipe-line  computations  or  to  represent  mutable 
communicating  program  objects,  for  instance  to  represent  real-world  objects  with  state. 
State,  as  encapsulated  in  communicating  objects,  and  the  idea  of  pipe-line  parallelism  have 
been  pivotal  in  the  design  of  the  other  systems  developed  on  the  project. 
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The  CAREL  project  was  used  mostly  as  a  feasibility  study  and  was  soon  discontinued. 

3.2.  CAOS 

The  first  implementation  of  the  Elint  application,  described  further  in  Section  5.1,  was 
made  without  the  benefit  of  any  problem-solving  framework,  per  se,  but  rather  using  an 
object-oriented  programming  architecture.  It  was  anticipated  that  the  application  could  be 
easily  mounted  almost  directly  on  the  CARE  machine  and  some  experiments  could  be  run 
quickly,  which  would  allow  us  to  learn  some  important  lessons  early  in  the  project. 

In  order  to  mount  the  application,  a  distributed  object-oriented  system  was  implemented. 
This  was  done  because  the  CARE  system  did  not,  at  the  time,  come  with  its  own 
"preferred"  object  system.  The  system  that  was  implemented  was  called  CAOS  [Schoen 
86,  4-433],  a  Concurrent  Asynchronous  Object-oriented  System.  It  was  implemented 
using  the  Flavors  system  supported  by  the  Lisp  machines  used  by  the  project.  It  had  a 
number  of  key  features: 

•  CAOS  objects  were  dynamically  instantiable  and  potentially  multiprocess  objects, 
though  each  would  execute  on  a  single  processor,  having  at  least  one  stack  group 
associated  with  each  CAOS  object. 

•  CAOS  objects  were  intentionally  large  grained.  This  was  because  it  was  anticipated 
that  the  communications  network  would  be  the  resource  most  competed  for,  thus 
encouraging  the  programmer  to  perform  a  lot  of  computation  in  order  to  reduce  the 
number  or  size  of  messages  sent. 

•  Packet-based  message-passing  was  used  as  the  metaphor  for  communication 
between  processes  through  streams  in  the  language  extensions  to  Lisp  provided  by 
CAOS. 

•  A  large  number  of  different  message  sending  primitives  were  defined,  including 
non-blocking  sends  that  did  not  require  a  reply  from  the  target  of  the  message, 
sends  that  returned  futures  to  the  values  returned  by  the  targets  and  send  operations 
which  blocked  immediately  in  order  to  wait  for  a  reply  from  their  targets. 

Contrary  to  our  intuition,  the  communications  network  proved  to  be  the  least  loaded  of  the 
CARE  machine's  resources  during  our  experiments  on  CAOS.  The  computational  expense 
of  supporting  its  complex  object  model  caused  the  granularity  of  the  resultant  computations 
to  be  too  large.  However,  the  real-time  signal  interpretation  application  developed  in 
CAOS  focussed  our  attention  on  such  key  factors  as  decomposition  grain  size  and  the  use 
of  replicated  pipelines  of  processes.  Because  of  the  computational  expense  for  each 
process,  the  CAOS  model  was  inconsistent  with  a  large  number  of  processors  executing 
tightly  coupled  subproblems  typical  of  reasoning  systems. 

3.3.  LAMINA 

Lamina  is  the  object  system  that  was  designed  after  the  lessons  were  learned  from  the 
CAOS  experiments  [Saraiya  90b,  4-394].  It  was  originally  intended  to  provide  a  very 
small,  light-weight  layer  on  top  of  the  CARE  machine  so  that  distributed  object-oriented 
programs  could  be  implemented  efficiently.  A  significant  part  of  the  motivation  for  the 
design  of  Lamina  was  the  desire  to  reduce  the  overhead  suffered  by  the  CAOS  system  in 
terms  of  associating  large  stack  groups  with  each  of  the  CAOS  objects.  Lamina  introduced 
the  idea  of  objects  with  restartable,  rather  than  resumable  code  segments,  which  do  not 
require  stacks  to  preserve  their  state  when  they  are  not  running.  Since  its  first  appearance 
Lamina  has  been  developed  extensively  and,  although  still  small  and  light-weight,  provides 
a  platform  for  the  development  of  computational  models  for  functional  and  shared-variable 
as  well  as  object-oriented  programming. 
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In  [Delagi  86a,  2-243]  not  only  are  the  three  programming  models  -  object-oriented, 
shared-variable  and  functional  -  shown  all  to  be  implementable  using  Lamina's  unifying 
stream  mechanism,  but  it  also  shows,  by  example,  how  these  programming  models  can  be 
used  to  create  pipelines,  how  to  manage  these  software  pipelines  and  how  structures  can  be 
dynamically  created  and  relocated  using  the  Lamina  model.  This  model  also  allows  the 
substantial  localization  of  storage  reclamation,  which  is  a  crucial  factor  in  the  development 
of  efficient ,  concurrent  garbage  collection  mechanisms. 

Lamina  has  been  used  to  implement  a  number  of  programs,  both  for  direct  implementations 
of  the  two  real-time  expert  systems  being  investigated  (see  Section  5),  AirTrac  and  Elint, 
and  also  a  number  of  numerical  programs.  Lamina  is  now  the  preferred  core  programming 
system  for  the  CARE  machine  and  applications  in  Lamina  have  consistently  shown  the 
highest  performance  of  all  programs  running  on  the  CARE  machine. 

3.4.  Inter-Processor  and  Inter-Process  Communication 

A  considerable  amount  of  work  has  been  performed  on  the  investigation  of  different  mech¬ 
anisms  for  inter-processor  and  inter-process  communication.  For  distributed-memory  ma¬ 
chines  we  believe  that  the  efficient  distribution  of  work  for  large  applications  is  crucially 
linked  to  the  efficient  implementation  of  multicast  communication  [Byrd  87a,  2-116].  In 
particular  we  have  concentrated  on  the  development  of  efficient  cut-through  routing 
methods  that  allow  the  effective  use  of  multicast.  In  [Byrd  88b,  2-196]  several  alternative 
cut-through  multicast  protocols  are  described  and  compared  experimentally.  One  particular 
adaptive  scheme  is  found  to  be  superior  to  the  others  investigated  both  in  performance  and 
in  the  fact  that  the  protocol  provides  cut-through  multicast  without  requiring  dedicated 
storage  in  the  communication  architecture  for  a  full  packet. 

Although  the  principal  thrust  of  the  project  has  been  towards  the  development  of  dis¬ 
tributed-memory  hardware,  the  fact  that  the  CARE  simulator  can  also  simulate  shared- 
memory  machines  has  allowed  the  investigation  of  the  relative  performance  of  these  two 
distinct  classes  of  machines  and  the  relative  performance  and  appropriateness  of  shared- 
variable  and  message-passing/object-oriented  programming  models.  In  [Byrd  88a,  2-181] 
a  particular  parallel  application  is  implemented  in  both  object-oriented  and  shared- variable 
styles.  Using  these  examples  it  was  possible  to  show  how  the  differences  in  programming 
model  affected  performance  and  what  the  costs  associated  with  each  model  were.  This,  the 
allowed  the  identification  of  strategies  for  minimizing  data  communication  costs  in  each  of 
these  programming  models. 

Work  late  in  the  project  focussed  on  the  design  of  hardware  that  might  provide  efficient 
support  for  both  the  shared-variable  and  the  message-passing  programming  models, 
particularly  through  the  use  of  cut-through  multicast  protocols  [Byrd  89, 2-205]. 

3.5.  Load-Balancing 

We  examined  load-balancing  problems  within  the  context  of  the  "vertical  slice."  (recall 
Figure  2)  [Hailperin  88,  3-1]  In  particular,  this  work  is  focussed  on  a  load-balancing 
method  which  migrates  Lamina  objects  in  a  large  network  (thousands  of  processing  ele¬ 
ments)  of  CARE  processing  elements  in  order  to  improve  the  performance  of  soft-real-time 
signal-interpretation  systems  such  as  Elint  and  AirTrac  (see  Section  5). 

Experiments  showed  that  without  special  attention  to  load  balancing,  performance  was 
seriously  degraded.  Without  load  balancing,  only  a  lightly-loaded  multicomputer,  which 
has  cause  to  create  processes  dynamically,  can  in  general  achieve  real-time  performance. 
The  studies  focussed  on  how  to  achieve  global  load  balancing,  which  would  be  an 
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attractive  solution  to  this  problem,  as  it  would  allow  the  effective  use  of  massively-parallel 
ensemble  architectures  for  larger  soft-real-time  problems. 

The  challenge  is  to  replace  quick  global  communication,  which  is  impractical  in  a  mas¬ 
sively-parallel  system,  with  statistical  techniques.  In  this  vein,  a  novel  approach  to  decen¬ 
tralized  load  balancing  was  investigated  based  on  statistical  time-series  analysis.  Each 
processing  element  estimates  the  system-wide  average  load  using  information  about  past 
loads  of  individual  sites  and  attempts  to  equal  that  average.  This  estimation  process  is 
practical  because  the  soft-real-time  systems  in  which  we  are  interested  naturally  exhibit 
loads  that  are  periodic,  in  a  statistical  sense  akin  to  seasonality  in  econometrics. 

A  load-balancing  system  for  Lamina/CARE  was  designed  using  this  load-characterization 
technique,  and  its  implementation  and  experimentation  with  it  in  the  context  of  the  ELINT 
and  AIRTRAC  applications  are  the  subject  of  a  Ph.D.  thesis  in  progress.. 

3.6.  Concurrent  and  High  Performance  Lisp 

In  an  attempt  to  understand  the  behavior  of  the  Lisp  language  on  shared  memory  machines, 
work  was  done  on  the  QLisp  system  [Okuno  87,  3-443].  Although  this  work  was  not 
used  directly  by  other  parts  of  the  project,  it  investigated  some  of  the  constraints  on 
parallelizing  production  systems  by  studying  the  OPS5  language.  This  was  the  first  large 
application  implemented  in  QLisp  and  it  was  found  that  QLisp  was  able  to  encode  all  of  the 
previously  found  sources  of  parallelism  in  OPS5,  which  amounted  to  a  proof  of  concept 
for  QLisp. 

3.7.  Distributed  Cache  Coherence 

A  significant  aspect  of  our  research  into  shared  memory  architectures  was  that  of  caching 
schemes  and  cache  coherence.  During  our  research  we  have  designed  and  developed  a  new 
scalable  cache  coherence  protocol  for  large  scale  shared  memory  architectures.  This 
protocol  has  lower  cost  and  more  robust  performance  than  previous  solutions. 

Cache  coherence  is  an  important  and  well  known  problem  in  shared  memory 
multiprocessors.  In  such  systems,  each  processor  has  an  associated  cache.  The  same  data 
may  be  shared  by  different  processors  and  thus  copies  of  the  data  may  be  present  in 
different  caches.  A  cache  coherence  mechanism  must  exist  in  order  to  keep  these  multiple 
copies  consistent  with  each  other. 

Bus-based  shared  multiprocessors  usually  provide  some  form  of  "snoopy"  cache  coherence 
protocol.  The  term  "snoopy"  arises  from  the  fact  that  on  a  write,  each  cache  watches  the 
addresses  transmitted  on  the  bus.  In  the  case  that  the  cache  has  a  copy  of  the  data,  it  is 
either  invalidated  or  updated.  Snoopy  cache  coherence  protocols  rely  on  the  bus  to  provide 
a  global  broadcast  and  such  systems  are  limited  by  the  n  imber  of  processors  a  bus  can 
support  before  it  saturates. 

In  order  to  overcome  the  requirement  of  a  broadcast  medium,  directory  based  protocols 
may  be  used.  Earlier  centralized  directory  based  protocols  maintain  information  about  the 
caches  that  have  copies  of  the  line  in  a  directory  that  is  an  extension  of  the  main  memory. 
This  can  potentially  cause  the  directory  to  become  a  bottleneck.  We  have  developed  a  new 
distributed  directory  protocol  that  is  based  or  a  singly-linked  list  of  caches.  Such  a  system 
is  more  scalable  than  earlier  solutions  in  terms  of  both  cost  and  performance.  This  research 
is  detailed  extensively  in  [Thapar  89a,  4-502],  [Thapar  89b,  4-527]  and  [Thapar  90,  4- 
542]. 
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4.  Problem-Solving  Frameworks 

One  of  the  key  layers  of  the  vertical  slice  strategy  was  the  organization  of  problem-solving 
activity  according  to  existing  AI  concepts.  AI  provided  us  with  a  number  of  different 
problem-solving  frameworks  as  candidates  for  this  study.  In  fact,  the  project  committed 
itself  at  an  early  point  to  the  Blackboard  problem-solving  model  [Engelmore  88}.  This  was 
not  an  entirely  arbitrary  choice.  The  blackboard  metaphor  had  already  been  applied 
successfully  in  the  area  of  real-time  signal  processing  [Nii  82],  the  selected  problem 
domain  for  the  project.  It  was  also  anticipated  that  the  blackboard  metaphor  would  help  us 
to  extract  parallelism  from  the  application  in  the  way  that  the  problems  were  formulated  be¬ 
cause  the  metaphor  has  a  model  of  asynchrony  built  into  it.  For  reasons  detailed  in  [Rice 
88a]  the  blackboard  model  turned  out  not  to  be  as  parallel  as  we  might  have  hoped,  but  we 
still  know  of  no  better  one  for  concurrent  execution. 

The  development  of  problem-solving  frameworks  for  parallel  computation  took  two  distinct 
courses.  First  was  the  development  of  a  fairly  conservative,  concurrent  implementation  of 
an  existing  blackboard  system  to  run  on  shared-memory  machines.  This  was  the  Cage 
system,  based  on  AGE  [Nii  79]  described  in  Section  4.1.  The  second  course  was  to 
rethink  the  blackboard  metaphor  from  scratch  in  the  hope  of  achieving  really  high 
performance  on  distributed-memory  multiprocessors,  such  as  the  CARE  machine.  This 
resulted  in  the  Poligon  system  described  in  Section  4.2. 

Three  generations  of  papers  have  been  produced  describing  the  strategy  of  the  project,  the 
Cage  and  Poligon  systems  as  they  evolved,  and  the  experimental  results  produced  by  these 
systems.  The  early  motivation  for  the  designs  of  these  systems  is  outlined  in  [Nii  86,  3- 
196],  while  [Nii  88a,  3-205]  and  [Nii  88b,  3-233]  show  the  evolution  of  these  concepts 
and  detail  the  experiments  performed  on  the  two  systems,  dwelling  in  particular  on  the 
factors  that  motivate  and  constrain  the  design  and  performance  of  parallel  systems  in 
general  and  of  parallel  problem-solving  systems  in  particular.  Numerous  lessons  were 
learned  in  the  process  of  this  research,  which  are  listed  in  the  above  reports  and  in  [Rice 
88a,  4-139]  and  [Rice  89b,  4-219]. 

4.1.  Cage 

Cage  (Concurrent  AGE)  [Aiello  86,  2-1],  [Aiello  89, 2-26]  is  a  reimplementation  of  the 
AGE  [Nii  79]  blackboard  system  framework  also  developed  at  the  Heuristic  Programming 
Project  at  Stanford.  The  central  idea  behind  Cage  is  that  the  blackboard  model  provides  a 
certain  amount  of  parallelism  by  its  very  nature.  It  should  therefore  be  possible  to  exploit 
this  parallelism  without  any  major  redesign  or  rethink  for  the  problem-solving  model. 
Cage  is,  therefore,  an  implementation,  which  is  designed  to  allow  the  concurrent  execution 
of  a  blackboard  system  through  the  concurrent  execution  of  the  knowledge  sources  and 
rules  in  the  application  (see  Figure  5).  A  key  factor  in  the  design  of  Cage  was  that  control 
of  which  rules  and  knowledge  sources  were  to  be  run  in  parallel  was  left  entirely  to  the 
user.  This  allowed  the  user  to  develop  an  application  serially,  debug  it  and  then  gradually 
increase  the  amount  of  parallelism  exhibited  by  the  application.  This  allowed  the  easy 
identification  of  bugs  that  were  a  function  of  the  concurrent  execution  of  small  components 
of  the  application.  It  also  allows  the  developer  to  experiment  with  different  configurations 
of  parallel  execution  so  as  to  maximize  the  performance  of  the  application,  which  might  not 
be  maximized  by  enabling  all  possible  concurrency  options  because  of  contention 
problems. 
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Figure  5.  The  Cage  Architecture.  Update  events  are  perceived  by  the  scheduling 

component  and  collected  in  a  global  event  queue.  The  scheduler  selects  the  knowledge 
sources  that  are  interested  in  any  given  event  and  can  execute  them  in  parallel.  These 
knowledge  sources  in  turn  inspect  the  blackboard  and  perform  updates  that  are  seen  by  the 

scheduler. 

At  the  outset  it  was  not  known  how  difficult  it  would  be  to  program  such  a  system  and  how 
much  performance  could  be  expected,  but  it  was  thought  that  such  an  architecture  might 
well  be  suitable  for  the  current  generation  of  multiprocessors,  which  mostly  have  a  shared- 
memory  design.  Blackboard  systems  are  typically  implemented  using  a  central,  shared 
database  to  represent  the  blackboard.  The  match  between  the  shared  blackboard  and  the 
shared  memory  resource  seemed  to  be  worth  investigating. 

The  Cage  system  was  implemented  first  on  a  simple  emulator,  which  emulated  the  func¬ 
tionality  of  a  QLisp  implementation  without  paying  the  costs  of  detailed  simulation.  It  was 
later  ported  to  run  on  the  CARE  simulator,  using  QL  an  implementation  of  QLisp  and 
Multilisp  language  primitives  built  on  top  of  the  Lamina  shared-variable  programming 
interface  [Saraiya  88, 4-324]. 

The  Elint  application,  described  in  Section  5.1  was  mounted  on  the  Cage  system  and  exper¬ 
iments  to  measure  its  speed-up  and  throughput  were  performed  on  it.  These  are  detailed  in 
[Aiello  88, 2-15]  and  [Rice  89a,  4-198].  The  Cage  system  has  shown  that  blackboard  pro¬ 
grams  can,  indeed,  be  run  in  parallel  in  a  relatively  simplistic  manner.  The  performance  of 
Cage,  however,  is  restricted  by  a  number  of  factors  [Nii  88b,  3-233]: 

•  its  implementation,  which  was  not  highly  tuned; 

•  its  architecture,  which  exhibits  significant  contention  for  global  shared  resources 
such  as  the  event  queue; 

•  the  QLisp  substrate,  on  which  it  is  built,  and 

•  the  shared-memory  hardware  upon  which  it  runs. 
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Thus,  although  the  Cage  architecture  is  a  viable  architecture  for  existing  shared-memory 
hardware  systems,  because  of  the  close  link  between  the  Cage  programming  model  and  its 
underlying  hardware,  we  do  not  anticipate  that  future  concurrent  expert  system  tools  will  be 
built  much  like  Cage.  We  believe  that  the  trend  of  multiprocessor  design  is  broadly  away 
from  shared-memory  machines  and  towards  distributed-memory  designs  because  of  their 
greater  ability  to  scale. 

4.2.  Poligon 

The  expectation  that  the  next  generation  of  multiprocessors,  for  reasons  of  simplicity,  per¬ 
formance  and  cost,  are  likely  to  be  distributed  memory  machines  required  a  reexamination 
of  the  blackboard  model  before  it  could  be  mounted  on  such  a  machine  in  a  manner  likely  to 
deliver  good  performance.  Poligon  [Rice  86a,  4-1]  and  [Rice  86b,  4-19],  a  domain 
independent  blackboard-like  programming  language  and  concurrent  programming 
environment  was  developed  in  an  attempt  to  address  these  needs.  Poligon  adopted  the 
view  that  processors  are  going  to  be  cheap  and  plentiful.  Thus,  is  would  be  quite 
acceptable  if  necessary  to  allocate  one  processor  or  more  to  each  node  on  the  blackboard. 

First  the  serializing,  centralized  control  mechanism  of  conventional  blackboard  systems 
was  discarded.  Distributing  the  nodes  of  the  blackboard  over  the  processor  network  al¬ 
lowed  the  knowledge  base  to  be  spread  over  the  blackboard  as  well,  so  as  to  eliminate  any 
performance  bottleneck  due  to  the  communication  costs  between  the  knowledge  base  and 
the  blackboard.  The  simplest  available  rule  invocation  mechanism  was  selected,  so  as  to 
maximize  performance;  rules  were  directly  attached  to  slots  of  the  nodes  on  the  blackboard. 
A  modification  to  a  slot,  to  which  a  rule  was  attached,  resulted  in  that  rule  being  invoked. 
Rule  invocations  were  spun  off  into  different  processes  on  different  processors  for  execu¬ 
tion,  thus  minimizing  the  length  of  the  critical  sections  on  the  processors  holding  black¬ 
board  nodes  and  allowing  multiple,  simultaneous  rule  invocations  for  the  same  modified 
blackboard  object  (see  Figure  6). 

In  practice,  these  mechanisms  did  indeed  result  in  good  performance,  but  they  also  resulted 
in  significant  problems.  Many  uncontrolled  asynchronous  processes,  all  reading  and 
writing  things  in  a  shared  database,  are  unlikely  to  reach  a  coherent  or  correct  answer. 
Extra  mechanisms  had  to  be  implemented,  which  allowed  the  blackboard  nodes  to  have 
"goals"  and  the  ability  to  evaluate  their  own  performance  with  respect  to  the  overall  goal  of 
the  system.  This  allowed  the  blackboard  nodes  to  make  local  decisions  about  whether  to 
perform  any  modification  operation  attempted  by  a  rule,  the  result  was  a  sort  of  dis¬ 
tributed  hill-climbing  behavior.  Nodes  iterated  towards  a  good  solution. 
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Figure  6.  The  Poligon  Architecture.  Updates  0,1  the  blackboard  are  observed  by 

rules  which  watch  specific  slots  of  blackboard  nodes.  These  rules  can  fire  in  parallel  caus¬ 
ing  further  updates  to  the  same  or  other  nodes.  This  flow  of  updates  from  one  node,  to 
another  implicitly  forms  pipes,  which  increase  the  parallelism  realizable  by  the  system. 


These  mechanisms  did  not  come  without  associated  costs  in  terms  of  granularity.  Although 
the  Poligon  system  delivers  very  high  performance  when  compared  to  other  blackboard 
systems  such  as  AGE,  it  nevertheless  significantly  lacks  the  performance  provided  by  an 
application  written  directly  in  Lamina.  However,  an  appropriate  conceptualization  and 
decomposition  of  a  problem  is  the  most  difficult  task  for  a  programmer,  and  the  most 
critical  for  obtaining  speed-up.  Poligon  is  a  relatively  high-level  language  compared  to 
Lamina,  and  as  such  gives  the  programme"  an  edge  in  conceptualization.  Poligon, 
therefore,  provides  a  fairly  general  concurren .  .implementation  of  the  blackboard  problem¬ 
solving  model  with  all  of  the  advantages  of  abstraction  and  modularity  that  this  confers.  It 
does  so,  however  at  a  price.  A  detailed  rationale  and  description  of  Poligon's  design  and 
implementation  can  be  found  in  [Rice  89b,  4-219].  This  paper,  through  a  detailed 
discussion  of  the  factors  that  limit  the  performance  of  blackboard  systems  in  general  and 
concurrent  blackboard  systems  in  particular,  shows  the  motivation  for  the  design  of 
different  aspects  of  Poligon,  detailing  the  evolution  of  numerous  different  aspects  of  the 
Poligon  system,  and  highlighting  the  deficiencies  of  each  design  that  was  attempted  and 
then  superseded.  It  also  describes  a  number  of  means  by  which  the  performance  of 
Poligon  could  be  improved  by  superior  compilation  if  it  were  to  be  turned  into  a  production 
quality  system. 


The  Elint  application,  described  in  Section  5. 1 ,  was  implemented  in  t’-  Gage,  Poligon  and 
Lamina  systems.  The  results  of  these  experiments  are  reported  in  [Rice  88b,  4-165],  [Rice 
89a,  4-198]  and  [Nii  88b,  3-233].  These  reports  also  describe  both  the  motivation  for  and 
architecture  of  Poligon  as  well  as  highlighting  numerous  experimental  results,  which  are 
analyzed  with  a  view  to  the  lessons  that  can  be  learned  from  Poligon's  performance.  In 
[Nii  89,  3-298]  a  discussion  is  given  on  the  way  in  which  the  serial  Elint  application  was 
recoded  to  as  to  run  on  the  concurrent  Poligon  framework.  This  has  numerous 
implications  for  the  development  of  concurrent  real-time  signal  understanding  problems. 


1-15 


Another  application  called  ParA’ole,  implemented  using  the  Poligon  framework,  is  de¬ 
scribed  in  Section  5.3. 

5.  Applications 

Our  research  strategy  called  for  the  project  work  to  be  application  driven.  The  search  for  an 
application  was  guided  primarily  by  two  considerations  (a)  practical  versions  of  the 
application  would  demand  significant  speed-up  in  execution,  and  (b)  methods  for 
approaching  the  application  held  a  certain  obvious  potential  for  concurrent  execution. 
Based  on  these  considerations,  the  application  area  chosen  was  real-time  signal 
understanding.  Existing  blackboard  systems,  such  as  HASP/SIAP  [Nii  82]  and  Tricero 
[Williams  84]  had  shown  both  that  the  blackboard  problem-solving  model  was  appropriate 
for  this  domain  and  that  the  performance  deliverable  using  existing  blackboard  tools  was 
inadequate  to  field  such  systems. 

What  we  needed,  therefore,  was  a  problem  which  was  complex  enough  to  give  us  a  rea¬ 
sonable  model  of  a  real  system,  and  yet  r/.mple  enough  that  we  would  not  spend  too  much 
effort  on  the  mechanics  of  its  implementation.  We  decided  initially  to  focus  on  a  problem 
called  Elint,  a  system  for  the  understanding  of  passive  radar  signals.  This  application, 
derived  from  Tricero,  is  described  in  Section  5. 1 

After  much  experimentation  it  was  determined  that  our  ability  to  exploit  parallelism  was 
being  constrained  by  the  problem  we  were  using  -  it  was  not  sufficiently  complex  in  terms 
of  the  amount  of  knowledge  and  the  amount  of  data  available.  In  the  search  for  a  more 
knowledge-rich  and  computationally  intensive  application  we  developed  the  AirTrac 
application,  a  system  for  interpreting  active  radar  signals,  which  is  described  in  Section 
5.2. 

Experiments  were  also  performed  in  application  domains  other  than  that  of  real-time  signal 
understanding;  ParAble,  a  system  for  fault-finding  in  particle  accelerator  beam  lines  has 
been  developed  using  the  Poligon  framework.  This  work  is  described  in  Section  5.3.  A 
number  of  numerical  or  semi-numerical  programs  have  also  been  developed  during  our 
more  hardware-related  experiments.  These  investigations  are  mentir  ”4  in  Section  5.4. 


1-16 


5.1.  Eiint 


(  Blackboard  nodes  1 


Figure  7.  The  Elinl  Application.  Sensor  data  is  abstracted  into  hypothetical  radar 

emitters,  which  are  tracked  as  clusters  of  emitters. 

Eiint  is  a  soft  real-time  system  for  the  interpretation  of  passive  radar  signals.  Data  are  col¬ 
lected  from  a  number  of  receiving  stations  and  are  integrated  so  as  to  allow  the  system  to 
track  radar  emitting  aircraft  as  they  pass  through  the  monitored  airspace.  The  data  are  ab¬ 
stracted  into  hypothetical  radar  emitting  platforms.  These  emitters  are  in  turn  collected  into 
clusters  of  emitters,  which  might  represent  a  number  of  planes  or  a  single  plane  using 
multiple  radar  systems,  as  is  often  the  case  with  modem  military  aircraft  (see  Figure  7). 

Eiint  was  first  implemented  using  the  CAOS  system.  It  was  originally  thought  that  this 
work  would  take  only  a  couple  of  months  to  do.  In  fact,  the  complete  task  —  implementa¬ 
tion,  experimentation  and  analysis  of  results  —  took  18  months.  We  learned  early  on  that 
it  is  bv  no  means  a  trivial  matter  to  reimplement  an  existing,  serial  application  in  a  parallel 
environment.  These  initial  experiments,  which  are  detailed  in  [Brown  86, 2-78],  delivered 
both  qualitative  and  quantitative  results  concerning  the  performance  of  a  concurrent  system 
such  as  we  were  envisaging,  over  a  variety  of  different  numbers  of  processors,  and 
investigated  such  critical  areas  as  overall  speed-up  and  "solution  quality."  The  concept  of 
solution  quality  arises  in  many  knowledge-based  systems,  where  there  is  no  such  thing  as 
the  correct  problem  solution,  but  only  satisficing  (i.e.,  acceptable)  problem  solutions.  A 
primary  objective  of  the  experiments  was  to  investigate  the  trade-offs  between  the 
imposition  of  various  synchronizations  (and  the  resulting  loss  of  concurrency)  and  the 
quality  of  the  problem  solution. 

Since  the  CAOS  implementation,  Eiint  has  been  implemented  three  times;  using  Lamina 
[Delagi  88b,  2-446]  [Saraiya  89, 4-337]  and  the  Cage  [Aiello  88, 2-15]  and  Poligon  [Rice 
88b,  4-165]  [Nii  88b,  3-233]  frameworks  and  a  number  of  experiments  have  been 
performed  on  them.  In  order  to  perform  any  of  these  experiments  we  found  it  necessary  to 
develop  a  technique  for  performance  measurement  that  actually  measured  the  sustainable 
data-rate  that  the  system  under  experimentation  could  maintain  for  a  given  number  of 
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processors  without  being  swamped  by  the  incoming  data,  i.e.  while  still  giving  non¬ 
increasing  latency  in  its  outputs.  This  technique  is  discussed  in  detail  in  [Nii  88b,  3-233]. 

Each  of  these  reports  details  not  only  the  underlying  architecture  of  the  solution,  for 
example  an  object-based,  pipelined  decomposition  in  the  case  of  the  Lamina  experiments, 
but  also  covers  extra  areas  for  experimentation  appropriate  to  the  framework  being  studied 
and  the  intended  level  of  abstraction  of  the  framework.  These  areas  included:  multiple 
grain  sizes  [Nii  88b,  3-233],  [Aiello  88,  2-15],  speed-up  as  a  function  of  only  pipeline 
parallelism  [Nii  88b],  [Rice  88b,  4-165],  [Rice  89a,  4-198],  scaling  with  respect  to 
knowledge  base  size  (number  of  rules)  [Nii  88b],  [Rice  88b],  [Rice  89a]  and  load 
balancing  [Saraiya  89, 4-337]. 

5.2.  AirTrac 

The  development  of  the  Elint  application  showed  that  the  amount  of  parallelism  that  could 
be  demonstrated  was  much  more  dependent  on  the  application  than  we  had  anticipated. 
Following  the  analysis  of  Reddy  and  Newell  [Reddy  77],  we  hypothesized  that  by 
extracting  parallelism  at  the  different  levels  of  the  system's  implementation  hierarchy  we 
could  gain  multiplicative  speed-up.  The  analysis  of  our  experiments  showed  that  the 
speed-up  was  disappointing,  largely  because  the  application  itself  did  not  have  enough 
potential  for  the  exploitation  of  parallelism. 

Our  response  to  this  was  to  develop  an  application  which  would  really  stretch  the  hardware 
and  software  we  were  developing  in  a  realistic  manner,  the  AirTrac  application  [Delaney 
86,  2-459]. 

The  AirTrac  problem  domain  sounds  superficially  like  that  of  Elint.  It  was  a  system  for  the 
interpretation  of  radar  data,  though  in  this  case  the  radar  systems  modeled  were  active,  not 
passive.  Unlike  Elint,  AirTrac  was  designed  to  go  beyond  simply  tracking  aircraft  and 
identifying  likely  threats.  The  scenario  for  AirTrac  was  the  detection  of  "smugglers"  flying 
across  a  border.  The  problem  faced  by  existing  radar  users  is  that  a  large  number  of 
legitimate  aircraft  travel  in  the  same  airspace  as  smugglers.  Smugglers  may  take  advantage 
of  variations  in  terrain  in  order  to  find  areas  of  poor  or  no  radar  reception.  They  also  resort 
to  other  evasive  tactics.  Thus  to  identify  and  track  smugglers,  the  AirTrac  application  had 
to  interpret  the  behavior  of  the  aircraft  it  was  tracking  over  time. 

The  system  was  designed  in  a  number  of  layers  so  that  different  implementation  efforts 
could  be  decoupled.  The  first  subsystem  implemented  was  called  the  Data  Association 
component  [Nakano  87,  3-149],  and  is  the  subsystem,  which  most  closely  matches  the 
Elint  application.  It  was  initially  intended  that  this  component  would  be  implemented  using 
the  Poligon  framework.  It  was  found,  however,  that  the  simulation  of  the  Poligon  system 
for  a  problem  as  complex  as  AirTrac  would  take  prohibitively  long.  Consequently  AirTrac 
was  implemented  directly  in  Lamina.  Substantial  speed-up  was  shown  (of  the  order  of  one 
hundredfold  with  the  use  of  one  hundred  processors),  which  seemed  to  increase  linearly 
with  the  number  of  processors.  This  encouraging  result  was  achieved  by  the  use  of 
replicated  pipelined  sequences  of  objects  processing  ihe  input  data.  It  was  further  found 
that  the  degree  of  correctness  of  the  solution  was  not  compromised  by  the  decomposition  of 
the  problem  so  as  to  make  it  execute  concurrently,  nor  was  it  affected  by  highly  overloaded 
input  data  conditions  [Nakano  87]. 

The  second  component  of  AirTrac,  Path  Association,  was  significantly  more  knowledge 
intensive  than  the  first.  The  task  of  the  Path  Association  module  was  to  group  together 
tracks  produced  by  the  Data  Association  component  into  plausible  aircraft  flight  paths. 
This  subsystem  was  also  implemented  directly  in  Lamina  initially.  However,  programming 
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in  the  raw  Lamina  framework  was  too  complex  and  time-consuming,  so  a  layer  was  built 
on  top  of  Lamina,  called  ELMA  [Noble  88b,  3-409],  which  provided  the  abstraction  model 
needed  for  the  implementation.  The  experiments  described  in  [Noble  88a,  3-309]  provided 
confirmation  of  the  earlier  results  obtained  with  the  Elint  application. 

The  project  leaders  decided  to  continue  to  focus  resources  on  the  Path  Association 
component,  where  there  was  still  much  to  learn  and  where  we  believed  further  speed-up 
and  insight  could  be  obtained.  In  [Muliawan  89,  3-48],  further  experiments  in  the  AirTrac 
application's  Path  Association  component  are  described.  The  effect  of  high-level  control 
strategies  on  system  performance  is  discussed,  as  is  the  effect  of  varying  the  frequency  and 
width  of  the  input  data,  for  various  numbers  of  processors.  System  performance  was 
measured  both  in  terms  of  sustainable  data  rate  and  in  terms  of  latency,  "excess  ratio"  and 
capacity.  The  relationship  between  the  quantitative  and  qualitative  performance  of  the 
system  is  also  discussed. 

The  final,  most  abstract,  component  of  AirTrac  —  Platform  Interpretation  —  was  intended 
to  classify  the  aircraft  being  tracked  by  the  Data  Association  and  Path  Association  modules 
and  to  predict  their  behavior,  based  on  these  classifications  and  their  past  actions.  A 
platform  classification  module  was  implemented,  using  a  general,  forward-chaining, 
concurrent  classification  system  [Clancey  84].  These  experiments  demonstrated  speedup 
and  are  described  in  [Maegawa  90,  3-20].  The  key  idea  was  to  view  the  c’-ssification 
system  as  a  network  of  nodes  representing  classifications  and  subclassifications.  Speedup 
was  achieved  through  the  concurrent  execution  of  multiple  instances  of  the  classification 
network.  Because  the  input  track  information  was  continuously  acquired  over  time,  the 
system  necessarily  supports  periodic  reevaluation  of  all  classifications.  That  is,  all 
conclusions  drawn  by  the  system  may  be  continuously  modified  as  new  supporting 
evidence  enters  the  network. 

5,3.  Par  Able 

The  ParAble  project  [Bandini  89]  was  an  attempt,  by  choosing  a  completely  different  appli¬ 
cation  domain,  to  test  the  generality  of  the  problem-solving  model  offered  by  Poligon.  To 
do  this  we  made  a  parallel  implementation  of  the  ABLE  system  [Selig  87],  developed  also 
at  Stanford. 

The  objective  of  the  ABLE  project  was  to  find  a  rational  and  fast  way  to  diagnose  particle 
accelerator  beamlines.  These  large  and  complex  machines  are  very  prone  to  beam 
alignment  problems  due  either  to  misalignment  of  the  magnets,  which  steer  and  focus  the 
beam,  or  to  problems  with  the  power  supplies  to  those  magnets,  which  result  in  the 
magnets  not  having  the  desired  strength.  These  beamlines  are  so  complex  that  it  can  take 
many  months  of  knob-twiddling  in  order  to  commission  them. 

By  the  use  of  an  analytic  model  of  the  transfer  functions  of  the  beam-line  components,  and 
a  number  of  heuristics  that  use  successive  runs  of  the  model,  comparing  the  results  with  the 
real  data  to  locate  the  faults,  the  ABLE  system  was  able  to  find  faults  in  such  systems  in 
about  ten  minutes.  As  panicle  accelerators  become  more  complex  there  may  well  be  a  need 
to  control  them  in  real  time,  so  although  there  is  no  immediate  need  for  higher  performance 
in  the  ABLE  system,  it  is  not  unreasonable  to  suppose  that  there  might  be  in  the  future. 

\  number  of  Experiments  have  been  performed  on  ParAble,  detailed  in  [Bandini  89, 2-58]. 
The  realizable  parallelism  in  this  project  was,  again,  found  to  be  limited  mostly  by  the 
availability  of  data  parallelism. 
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5.4.  Numerical  and  Semi-numerical  programs 

The  expert  systems  mentioned  above  are  not  ideal  applications  for  multiprocessor  execu¬ 
tion.  TTiey  are  irregular  and  very  data  dependent.  A  large  body  of  applications  already  ex¬ 
ists  in  the  area  of  numerical  and  semi-numerical  processing,  which  will  require  the  speed¬ 
up  associated  with  parallel  execution.  Indeed,  such  programs  are  already  being  run  on  a 
number  of  multiprocessors.  It  is  therefore  essential  that  any  machine  designed  with  a  view 
to  being  general-purpose  must  also  be  able  to  execute  these  regular,  algorithmic  problems 
efficiently.  A  number  of  small  numerical  programs  were  developed,  to  be  used  for 
experiments  in  system  architectures  and  topologies  [Byrd  88a,  2-181]  and  [Byrd  88b,  2- 
196].  These  experiments  allowed  us  to  test  our  hardware  and  software  ideas  in  a  much 
more  controllable  way  than  we  can  with  any  expert  system  application. 

6.  Conclusions,  Observations  Results 

The  previous  sections  summarized  the  experiments  performed,  the  types  of  computations 
explored,  the  simulation  engines  built  to  conduct  the  experiments,  and  some  experimental 
results.  We  tied  each  of  these  to  specific  technical  reports  of  the  project.  In  this  section, 
we  add  conclusions,  results,  and  observations  of  a  general  nature.  These  have  been  drawn 
from  across  the  range  of  experiments  performed,  and  we  believe  will  be  of  interest  to  a 
large  body  of  computer  scientists  interested  in  the  problems  of  parallel  computation. 

We  begin  with  words  of  caution.  Our  experiments  were  performed  mainly  in  the  area  of 
symbolic  problem  solving  by  computer-that  is,  the  traditional  mainstream  area  of  artificial 
intelligence  research.  The  kinds  of  entity  typically  manipulated  were  symbolic  objects  and 
rules,  not  algebraic  formulas  or  matrices  of  numbers.  The  computations  were  largely 
symbolic  computations  (as,  for,  example  typically  performed  in  the  LISP  language). 

Low-level  representational  choices,  constituting  the  focus  of  our  experiments,  and  therefore 
influencing  our  conclusions,  include  object-orientation  with  message  passing,  on  a  MIMD- 
type  machine.  Most  experiments  were  performed  using  distributed-memory  system 
architecture.  One  final  caveat:  all  experiments  were  performed  on  our  instrumented 
simulator.  Though  we  are  confident  of  the  quality  and  veracity  of  the  simulated 
computations,  a  simulator  is  only  a  model  of  reality. 

Finally,  one  must  always  keep  in  mind  the  simple  algebraic  relationship  (often  called 
“Amdahl’s  law”).  The  ultimate  limit  to  speedup  of  computation  on  a  parallel  machine,  the 
"Amdahl  limit"  is  determined  by  the  residual  amount  of  “serial  processing”  remaining  in  the 
computation  after  the  programmer  has  extracted  and  used  all  the  parallel  computation 
schemes  possible.  Thus,  for  example,  if  the  intrinsic  serial  component  of  the  computation 
is  no  less  than  1%,  then  the  overall  speedup  can  not  exceed  two  orders  of  magnitude 
(xlOO). 

To  repeat,  in  reading  what  follows,  the  reader  should  have  in  mind  the  general  picture  of  a 
two-dimensional  network  of  LISP  computers  (each  with  a  communications  subprocessor) 
of  size  NxN  (typically  10x10  or  16x16).  These  processors  are  receiving,  as  input,  streams 
of  encoded  sensor  data,  and,  with  some  latency  (e.g.,  milliseconds  or  seconds  of  sensor 
time),  computing  hypotheses  of  platform  track  segments,  platform  identity,  etc. 
Computational  work  is  distributed  over  the  multiprocessor  but  many  of  the  nodes  of  the 
NxN  network  are  not  necessarily  busy  all  the  time. 


1-20 


6.1.  Speedup  over  serial  computation 

An  early  result  by  Gupta  [Gupta  86]  for  parallel  ruie-based  computation  indicated  that  a 
speedup  of  approximately  one  order  of  magnitude  (OM=  xlO)  was  achievable.  Our 
experiments  confirmed  that  speedups  of  approximately  one  OM  were  readily  achieved,  but 
not  without  significant  programming  work  and  ingenuity. 

Speedups  of  2  OM  were  very  difficult  to  achieve  for  individual  problem  solving  efforts  but 
were  achievable  for  groups  of  these  efforts  (e.g.,  one  aircraft  versus  many  aircraft).  We 
refer  to  such  application  circumstances  as  being  characterized  by  “data  parallelism.”  In 
data-parallel  situations  (which  may  be  quite  normal  in  the  world  of  computing 
applications),  the  overall  intrinsic  parallelism  can  be  made  sufficiently  high  relative  to  the 
intrinsic  serialization  that  2  OM  (xlOO)  is  achievable. 

Speedups  of  3  OM  (xlOOO)  were  well  beyond  the  reach  of  any  techniques,  or  any  problem 
size,  explored  in  this  study. 

Because  of  the  limits  imposed  by  the  inherent  serialization,  even  when  the  application  is 
augmented  by  favorable  data  parallelism,  speedup  will  reach  a  ceiling,  beyond  which  one 
cannot  push  the  speedup  by  simply  increasing  the  number  of  computing  nodes. 

When  working  in  a  problem  environment  in  which  data  enter  in  continuous  streams, 
determining  how  to  measure  effective  speedup  is  an  important  issue.  Our  observation  is 
that  stable  latency  in  delivering  a  computational  hypothesis  (i.e.,  answer),  after  the 
corresponding  data  are  presented  in  the  input  stream,  is  the  appropriate  measure.  Thus,  for 
each  multiprocessor  configuration  many  input  data  rates  must  be  tried  before  a  stable 
latency  can  be  found.  The  determination  of  speedup  is  thus  a  lengthy  process.  This 
technique  is  in  contrast  to  the  more  common  method  of  simply  dividing  the  application's 
run-time  into  the  run-time  on  a  single  processor.  We  found  this  latter  method  to  be  highly 
deceptive  for  real-time  and  data-reactive  applications. 

Our  observation  is  that  the  most  significant  sources  of  “parallelization”  of  a  problem  are  to 
be  found  in  the  application  itself,  and  therefore  by  the  application  programmer.  Of  course, 
system-level  language  and  architectural  features  must  be  there  to  support  this  human 
programmer  creativity.  Because  the  programmer  plays  such  a  vital  role  in  realizing  the 
speedup  from  parallelization,  programmer  language  tools  for  conceptualizing  and  writing 
concurrent  programs  are  of  great  importance.  Equally  important  are  well-instrumented 
debugging  tools  to  aid  the  programmer  since  the  complexity  of  parallel,  run-time 
environments  is  well  beyond  anything  that  even  the  best  programmers  have  been  trained  to 
cope  with. 

We  can  not  emphasize  too  much  the  importance  of  a  variety  of  software  instruments  in 
tuning  parallel  computations.  This  must  be  provided,  cither  by  the  manufacturer  (in  the 
form  of  software  instruments  responding  to  hardware  measurements)  or  by  a  fully 
instrumented,  carefully  designed  simulator  of  the  parallel  hardware.  Today,  neither  of 
these  is  routinely  done.  We  found  the  feedback  provided  by  the  instrumentation  in  our 
simulator  essential  in  refining  designs:  to  break  bottlenecks,  balance  pipelines,  evaluate 
load  balancing  schemes,  and  so  forth. 

Careful  thought  must  be  given  to  instrumentation  at  the  application  level  (not  just  the 
machine  level)  because  execution  behavior  is  just  too  complex  for  programmers  to  think 
through.  Experimentation  is  needed  to  decide  how  to  instrument  at  the  application  level. 
These  decisions  are  not  clear  and  straightforward. 
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6.2.  Pipelines 

The  architectural  approach  to  concurrency  that  most  consistently  proved  effective  in  our 
experiments  was  building  pipelines  for  computations,  and  replicating  them  when 
necessary. 

•  For  our  data-streaming  application  environment,  pipelines  were  a  natural  fit  for 
exploiting  the  intrinsic  parallelism  and  the  data  parallelism  in  the  problem.  (By 
intrinsic  parallelism  we  mean  that  the  stages  in  the  pipeline  correspond  in  a  natural 
way  to  steps  in  the  problem  solving  process:  inference  steps;  subproblem  pipes; 
multiple  hypotheses  and  goals,  etc.) 

•  Pipelines  must  be  carefully  balanced  during  execution.  For  example,  if  a  pipeline 
consists  of  a  series  of  invoked  knowledge  sources,  these  must  have  similar  “grain 
size”  and  data  density. 

•  Some  computations  branch  in  a  fan-out  manner.  The  pipeline  approach  to  fan-out 
computations  can  be  made  to  yield  good  speedup. 

•  Conversely,  fan-in  of  computations  is  disruptive  of  pipelining,  and  seriously 
impacts  the  speedup  that  can  be  realized.  A  particular  type  of  fan-in  occurs  when 
symbols  are  passed  “up”  an  abstraction  hierarchy,  e.g.,  when  special  cases  are 
recognized  as  instances  of  a  more  general  case.  In  an  abstraction  hierarchy 
pipeline,  it  is  important  that  the  communication  up  the  hierarchy  should  be  designed 
to  decrease,  proportionately  to  the  amount  of  “branchiness”  at  the  lower  levels,  as 
the  symbolic  data  flows  “up”  the  hierarchy. 

Our  experiments  with  pipelines  showed  that  resumable  processes  are  rarely  needed  (hence, 
the  architecture  i.eed  not  treat  this  issue  as  one  of  high  priority).  Most  computational  grains 
can  be  realized  in  such  a  way  that  they,  and  therefore  the  processes  in  which  they  run,  can 
run  to  completion.  As  a  corollary,  our  experiments  showed  that  the  significant  costs 
associated  with  process  instantiation  and  process  switching  can  often  be  avoided  by  the  use 
of  this  run-to-completion  programming  model. 

6.3.  Basic  computational  metaphor 

The  object-oriented  message-passing  paradigm,  was  found  to  be  a  natural  and  comfortable 
metaphor  in  conceptualizing  concurrent  programming  and  in  thinking  through  the  issues  of 
distributed  memory  and  communications.  This  model  was  found  to  be  highly  compatible 
with  the  underlying  message-passing,  distributed-memory  system  architecture  used  in  our 
experiments.  It  is  not  clear  that  all  object-oriented  models  will  have  this  property.  For 
example,  multimethods  in  CLOS  may  be  incompatible  with  distributed  memory 
architectures. 

6.4.  Communication 

Our  experiments  showed  that  communication  patterns  among  processes  were  surprisingly 
static.  The  implication  for  interprocess  communication  is  to  prefer  “streams”  to  “futures,” 
i.e.  to  amortize  the  cost  of  initiating  communication  between  processes  by  maintaining 
connections  and  passing  more  than  one  value  between  connected  processes. 

An  architecture  for  hardware-supported  multicast  was  designed  that  provided  for  adaptive 
cut-through  routing.  Qur  experiments  proved  its  effectiveness  for  deadlock  avoidance. 
The  scheme  provides  cut-through  multicast  without  requiring  dedicated  storage  in  the 
communication  facilities  for  a  full  packet. 


1-22 


6.5.  Problem  Solving  Methods 

A  relatively  straightforward  “parallelization”  of  the  blackboard  problem  solving  framework 
effective  in  serial  environments,  i.e.,  Cage  (Concurrent  AGE)  versus  AGE,  proved  to  be 
ineffective,  i.e.,  did  not  deliver  much  speedup.  The  key  issue  was  centralized  control,  and 
the  consequent  low  "Amdahl  limit",  mentioned  earlier,  that  resulted  from  the 
synchronization  at  this  central  control  point. 

A  radically  reorganized  blackboard  framewori.  (Poligon)  using  decentralized  control, 
enabled  significant  speedup. 

•  The  Poligon  experiments  showed  that  although  problems  can  be  solved  without 
global  control,  some  limited,  non-local  control  was  needed  (for  example,  to  manage 
node  creation). 

•  For  the  minimal  control  regime,  each  solution  node  should  have  its  own  goals  and 
evaluation  functions  (to  enable  local  “hill  climbing”).  The  use  of  local  “hill¬ 
climbing”  resulted,  in  our  experiments,  in  globally  valid,  mutually  consistent 
results,  in  spite  of  the  lack  of  global  coordination.  It  was  observed  that  this  local 
hill-climbing  reduces  the  latency  in  getting  the  plausible  answer  (i.e.,  performance 
improves). 

We  must  caution,  however,  that  not  all  problems  can  be  solv.  j  (effectively  or  at  all)  with  a 
control  regime  that  enforces  this  “local”  view.  Here,  as  before,  the  choice  of  approach  is 
application-dependent,  or  at  least  dependent  on  the  way  a  problem  is  formulated.  One  must 
consider  the  tradeoff  between  global  control  versus  local  knowledge  (in  the  form  of 
evaluation  functions  and  local  control). 

Concerning  rule  processing  in  problem  solving,  our  experiments  showed  that  the  rules 
within  sets  of  rules  (i.e.,  knowledge  sources)  can  be  run  in  parallel,  and  that  this 
contributes  to  speedup.  However,  running  rules  in  parallel  requires  encapsulation  of  the 
data  used  by  the  rules  (i.e.,  the  rule  context).  Because  the  context  can  become  obsolete  by 
the  time  the  rule  is  processed,  this  technique  needs  to  be  used  in  conjunction  with  local  hill 
climbing  to  prevent  outdated  and  invalid  conclusions  from  being  recorded. 

We  observe  that  the  quality  of  a  solution  is  an  issue  in  parallel  problem  solving.  AI 
problem  solving  methods  usually  “satisfice,"  i.e.,  there  is  no  one  “ilght  answer."  But  in 
real  applications  some  answers  are  better  than  others.  With  decentralized  c-mtrol,  it  can  be 
difficult  for  the  programmer  to  orient  the  program’s  behavior  always  in  the  Ji*  y-tion  of  the 
“better”  answers.  Here  is  an  obvious  tradeoff-the  more  centralized  the  control,  :le  more 
programmer  guidance  toward  the  “better”  answers,  the  less  the  speedup  (too  much 
centralized  control,  synchronization,  i.e.  a  low  Amdahl  limit).  In  our  experiments  we  were 
able  to  preserve  solution  quality  by  keeping  data  consistent  and  controlling  order-critical 
tasks.  However,  we  did  not  extract  a  general  technique  or  even  a  general  engineering 
understanding  of  this  fascinating  issue. 

A  related  issue  is  the  tradeoff  between  problem  decomposition  and  degree  of  coupling 
among  the  decomposed  subproblems.  We  observed  that  as  problems  are  decomposed  into 
smaller  grains,  the  subproblems  became  more  interdependent  (more  data  sharing,  more 
communication),  nullifying  the  potential  parallelism.  Thus,  optimal  grain  size  is  highly 
dependent  on  the  application  as  well  as  on  the  processing  overhead. 
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6.6.  Development  strategy 

We  observed  a  three-part  strategy  to  be  important. 

a.  Do  a  serial  execution  of  the  parallel  program.  This  removes  the  obvious  elementary 
bugs. 

b.  Proceed  to  a  relatively  crude  parallel  simulation,  using  a  purely  functional  simulator 
with  randomized  scheduling  of  resources.  This  will  catch  the  first  level  of  “parallel 
processing”  bugs. 

c.  Proceed  to  the  fine-grained  instrumented  simulator  for  the  experiments  and  the 
performance  tuning. 

6.7.  Analysis  of  the  application 

Not  enough  can  be  said  in  motivating  a  careful  study  of  the  application  to  understand  its 
intrinsic  parallelism.  Previously  employed  senal  processing  schemes  used  to  handle  the 
application  can  be  seriously  misleading  and  ineffective. 

In  one  of  our  experiments  (ParAble),  the  domain  expert,  a  accelerator  physicist,  was  able  to 
reconceptualize  the  computation  in  a  way  that  was  not  only  “highly  parallel”  (enabling  a 
good  experimental  result  for  us)  but  also  “highly  natural,"  enabling  him  to  understand  his 
problem  with  great  clarity. 

Applications  analysts  should  not  try  to  “force  fit”  their  application  to  the  parallel  computing 
metaphor,  but  rather  should  seek  a  natural,  intrinsic  parallel  structure  of  the  computation. 

6.8.  Load  balancing 

Balancing  the  computational  load  among  the  nodes  of  a  multicomputer  is  a  serious  issue  of 
performance  and  economics.  It  takes  both  computing  and  communication  to  perform 
effective  load  balancing,  so  an  obvious  performance  tradeoff  is  involved. 

Our  experiments  on  adaptive  global  load  balancing  used  an  explicit  stochastic-process 
model  to  estimate  the  time  evolution  of  processor  loading  and  to  model  the  dissemination  of 
load-information.  This  model  allows  improved  estimates  to  be  made  of  system-wide 
loading,  which  allows  a  given  level  of  load  balance  to  be  achieved  with  far  fewer  object 
migrations.  This  in  turn  improves  the  system’s  performance  (in  ferms  of  consistently 
meeting  latency  deadlines),  provided  that  migration  costs  are  sufficiently  high  (remember 
the  tradeoff  mentioned  above).  The  performance  improvement  achieved  and  the 
circumstances  under  which  it  can  be  achieved,  however,  were  found  to  be  seriously  limited 
by  the  unexpectedly  poor  correlation  between  load-estimate  quality  and  load  balance. 
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Attn:  Mr  Allen  Sears 
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MISSION 

OF 

ROME  LABORATORY 

Rome  Laboratory  plans  and  executes  an  interdisciplinary  program  in  re¬ 
search,  development,  test,  and  technology  transition  in  support  of  Air 

O 

Force  Command,  Control,  Communications  and  Intelligence  (C  l)  activities 
for  all  Air  Force  platforms.  It  also  executes  selected  acquisition  programs 
in  several  areas  of  expertise.  Technical  and  engineering  support  within 
areas  of  competence  is  provided  to  ESD  Program  Offices  (POs)  and  other 

O 

ESD  elements  to  perform  effective  acquisition  of  C  l  systems.  In  addition, 
Rome  Laboratory's  technology  supports  other  AFSC  Product  Divisions,  the 
Air  Force  user  community,  and  other  DOD  and  non-DOD  agencies.  Rome 
Laboratory  maintains  technical  competence  and  research  programs  in  areas 
including,  but  ~»ofc  limited  to,  communications,  command  and  control,  battle 
management,  intelligence  information  processing,  computational  sciences 
and  software  producibility,  wide  area  surveillance/sensors,  signal  proces¬ 
sing,  solid  state  sciences,  photonics,  electromagnetic  technology,  super¬ 
conductivity,  and  electronic  reliability/maintainability  and  testability. 


